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Abstract 

This paper presents a generalised two- 
level implementation which can handle lin- 
ear and non-linear morphological opera- 
tions. An algorithm for the interpretation 
of multi-tape two-level rules is described. 
In addition, a number of issues which arise 
when developing non-linear grammars are 
discussed with examples from Syriac. 



1 Introduction 



The introduction of two- level morphology (Kosken 
and subsLHj ueuL developine 



ills has i.u; u K' 



implementing eoiupuLaLional-inoiphulogy models a 

feasible task. Yet, two-level formalisms fell short 
from providing elegant means for the description of 
non-linear operations such as infixation, circumfix- 



ation and root-and-pattern morphologyp 
suit, two-level implementations - e.g. ( 


As a re- 
Antworth, 


1990 


; 


Karttuncn, 1983 




Karttuncn and Beesley 


1992 


; 


Ritchie et al, 1992 


) - have always been bi- 



ased towards linear morphology. 

The past decade has seen a number of proposals 
for handling non-linear morphology]^] however, none 



* Supported by a Benefactor Studentship from St 
John's College. This research was done under the super- 
vision of Dr Stephen G. Pulman. Thanks to the anony- 
mous reviewers for their comments. All mistakes remain 
mine. 

1 Although it is possible to express some classes of 
non-linear rules using standard two-level for malisms by 
mean s of ad hoc diacritics, e.g., infixation in (Antworth. 
1990, p. 156), there are no means for expressing other 

class es as root-a nd- pattern phenomena. 

2 ([Kay. 1987j^l . (jtCataia and Koskcnnicmi 1988|). 
f|Beesley ct~aL 198fi| l. (|Lavie et al. 199t)|h feeeslev . 

r99TTi 



. et al.. 198 
199C|). (jBeeslev. 199l| 



. .. ([Kornai. 1991|). fjWicbc. 1992|. 
fjPulman and Hopple. 1993|). fjNarayaiian" and Hashcm 



1993|), and jBird and Ellison, 1994| ). See ( p<iraz, 1996; ) 
for a review. 



(apart from Beesley's work) seem to have been im- 
plemented over large descriptions, nor have they pro- 
vided means by which the grammarian can develop 
non-linear descriptions using higher level notation. 

To test the validity of one's proposal or formalism, 
minimally a medium-scale description is a desider- 
atum. SemHe^] fulfils this requirement. It is a gen- 
eralised multi-tape two-level system which is being 
used in developing non-linear grammars. 

This paper (1) presents the algorithms behind 
SemHe; (2) discusses the issues involved in compil- 
ing non-linear descriptions; and (3) proposes exten- 
sion/solutions to make writing non- linear rules eas- 
ier and more elegant. The paper assumes knowledge 



of multi-ta pe two- level morphology ( Kay, 1987 ; Ki 



1994c) 



2 Linguistic Descriptions 

The linguist provides SemHe with three pieces of 
data: a lexicon, two-level rules and word formation 
grammar. All entries take the form of Prolog terms.^j 
(Identifiers starting with an uppercase letter denote 
variables, otherwise they are instantiated symbols.) 
A lexical entry is described by the term 

synword( (rnorp/ieme) , {category}). 

Categories are of the form 

(category symbol) : [(/ eaturejattri — valuei), 
. . . , 

(featurejxttr n = value n )] 
a notational variant of the PATR-II category formal- 



ism dShiebcr, 1986] ). 

3 The name SemHe (Syriac semhe 'rays') is not an 
acronym, but the title of a grammatical treatise writ- 
ten by the Syriac polymath (inter alia mathematician 
and grammarian) Bar 'Ebroyo (1225-1286), viz. ktobo 
dsemhe 'The Book of Rays'. 

4 We describe here the terms wh ich are relev ant to this 
paper. For a full description, see ( [Kiraz, 1996 ). 



tl_alphabet(0, [k, t ,b , a, e] ). % surface alphabet 

tl_alphabet(l, [ci , C2 , C3 , v, b] ). tl_alphabet(2, [k, t ,b, b] ). tl_alphabet(3, [a,e,b]). % lexical alphabets 
tl_set(radical , [k, t ,b] ). tl_set(vowel , [a, e] ). tl_set(clc3 , [01,03]). % variable sets 

tl.rule(Rl, [[],[],[]], [[b],[b],[b]], [[],□,[]],=>, [], [], [], 
□ , [[],[],[]]). 

tl.rule(R2, [[],[],[]], [[P] , [C] , []] , [[],□,[]],=>, □ , [C] , □, 

[clc3(P) ,radical(C)], [ [] , [] , []]). 
tl.rule(R3, [[] , [] , []], [ [v] , [] , [V]], [[] , [] , []],=>, [], [V], [] , 

[vowel (V)], [[],[],[]]). 
tl.rule(R4, [[],[],[]], [ [v] , [] , [V] ] , [ [c 2 ,v] ,[],[]], <=>, [] , [], □, 

[vowel (V)], [[],[],[]]). 
tl_rule(R5, [[],[],[]], [ [c 2 ] , [C] , [] ] , [[],[],[]],<=>, [] , [C] , [] , 

[radical (C) ] , [ [] , [root : [measure=p < al] ] , []]). 
tl.rule(R6, [[],[],[]], [ [c 2 ] , [C] , [] ] , [[],[],[]],<=>, [], [C,C], [] , 

[radical(C)] , [ [] , [root : [measure=pa' 1 el] ] , []]). 

Listing 1 



A two-level rule is described using a syntactic vari- 
ant of the formalism described by (Ruessink, 198E 



by (Kiraz, 1994c) 



Pulman and Hepple, 1993), including the extensions 



tl_rule((id) , {LLC) , {Lex), {RLC), {Op), 
{LSC),{Surf),{RSC), 
{variables), {features)). 

The arguments are: (1) a rule identifier, id; (2) the 
left-lexical-context, LLC, the lexical center, Lex, and 
the right-lexical-context, RLC, each in the form of a 
list-of-lists, where the ith list represents the zth lex- 
ical tape; (3) an operator, => for optional rules or 
<=> for obligatory rules; (4) the left-surface-context, 
LSC, the surface center, Surf, and the right-surface- 
context, RSC, each in the form of a list; (5) a list 
of the variables used in the lexical and surface ex- 
pressions, each member in the form of a predicate 
indicating the set identifier (see infra) and an argu- 
ment indicating the variable in question; and (6) a 
set of features (i.e. category forms) in the form of a 
list-of-lists, where the zth item must unify with the 
feature-structure of the morpheme affected by the 
rule on the ith lexical tape. 

A lexical string maps to a surface string iff (1) 
they can be partitioned into pairs of lexical-surface 
subsequences, where each pair is licenced by a rule, 
and (2) no partition violates an obligatory rule. 

Alphabet declarations take the form 
tl_alphabet((£ape), {symboLlist)) , and variable 
sets are described by the predicate tl_set((id), 
{symboLlist)) . Word formation rules take the form of 
unification-based CFG rules, syn.rule({identifier), 
{mother), [{daughter \) , . . . , {daughter n )}). 



The following example illustrates the derivation 
of Syriac /ktab/Q 'he wrote' (in the simple p'al mea- 
sure)^ from the pattern morpheme {eveve} 'verbal 
pattern', root {ktb} 'notion of writing', and vocalism 
{a}. The three morphemes produce the underlying 
form */katab/, which surfaces as /ktab/ since short 
vowels in open unstressed syllables are deleted. The 
process is illustrated in (l).f 



(1) 



c v c v c 



Vkatab/ =^> /ktab/ 



The pa "el measure of the same verb, viz. /katteb/, is 
derived by the gemination of the middle consonant 
(i.e. t) and applying the appropriate vocalism {ae}. 

The two-level grammar (Listing 1) assumes three 
lexical tapes. Uninstantiated contexts are denoted 
by an empty list. Rl is the morpheme boundary 
(= b) rule. R2 and R3 sanction stem consonants 
and vowels, respectively. R4 is the obligatory vowel 
deletion rule. R5 and R6 map the second radical, 
[t], for p'al and pa "el forms, respectively. In this 
example, the lexicon contains the entries in (2).[] 
(2) synword(c!VC 2 vc3, pattern : []). 

synword(ktb, root : [measure = M]). 

synword(aa, vocalism : [measure = p'al]). 

synword(ae, vocalism : [measure = pa "el]). 



Spirantization is igno red here: fo r a discussion on 
Syriac spirantization, see (Kiraz, 1995). 



6 Syriac verbs are classified under various measures 
(forms). The basic ones are: p'al, pa "e l and 'af'el. 



7 This analysis is along the lines of ( McCarthy. 1981 ) 
based on autosegmental phonology (Goldsmith, 19761). 



Spread ing is ignored here; for a discussion, see (Ki- 
|raz, 1994c]). 



Note that the value of 'measure' in the root entry is 
uninstantiated; it is determined from the feature val- 
ues in R5, R6 and/or the word grammar (see infra, 

"1- 

3 Implementation 

There are two current methods for implement- 
ing two- level rules (both implemented in SemHe): 
(1) compiling rules into finite-state automata (multi- 
tape transducers in our case), and (2) interpreting 
rules directly. The former provides better perfor- 
mance, while the latter facilitates the debugging of 
grammars (by tracing and by providing debugging 
utilities along the lines of ( Carter, 1995| )). Addi- 



tionally, the interpreter facilitates the incremental 
compilation of rules by simply allowing the user to 
toggle rules on and off. 

The compilation of the above formalism into au- 



tomata is described by ( Grimley-Evans et al., 1996). 
The following is a description of the interpreter. 

3.1 Internal Representation 

The word grammar is compiled into a shift-reduce 
parser. In addition, a first-and-follow algorithm, 
based on ( Aho and Ullman, 1977 ), is applied to 
compute the feasible follow categories for each cat- 
egory type. The set of feasible follow categories, 
NextCats, of a particular category Cat is returned 
by the predicate Follow(+ Cat, -NextCats). Ad- 
ditionally, FOLLOw(bos, NextCats) returns the set 
of category symbols at the beginning of strings, and 
eos 6 NextCats indicates that Cat may occur at the 
end of strings. 

The lexical component is implemented as charac- 



ter tries (Knuth, 1973), one per tape. Given a list 
of lexical strings, Lex, and a list of lexical pointers, 
LexPtrs, the predicate 

Lexical-Transitions (+Lea;, +LexPtrs, 
—NewLexPtrs, —LexCats) 

succeeds iff there are transitions on Lex from LexP- 
trs; it returns NewLexPtrs, and the categories, Lex- 
Cats, at the end of morphemes, if any. 

Two-level predicates are converted into an inter- 
nal representation: (1) every left-context expression 
is reversed and appended to an uninstantiated tail; 
(2) every right-context expression is appended to an 
uninstantiated tail; and (3) each rule is assigned a 
6-bit 'precedence value' where every bit represents 
one of the six lexical and surface expressions. If an 
expression is not an empty list (i.e. context is spec- 
ified), the relevant bit is set. In analysis, surface 
expressions are assigned the most significant bits, 



while lexical expressions are assigned the least sig- 
nificant ones. In generation, the opposite state of 
affairs holds. Rules are then reasserted in the or- 
der of their precedence value. This ensures that 
rules which contain the most specified expressions 
are tested first resulting in better performance. 

3.2 The Interpreter Algorithm 

The algorithms presented below are given in terms 
of prolog-like non-deterministic operations. A clause 
is satisfied iff all the conditions under it are satisfied. 
The predicates are depicted top-down in (3). (SemHe 
makes use of an earlier implementation by (Pulman 
|and Hcpple, 1993Q .) 

(3) 




Two-Level-Analysis 


! 




f \ 
Coerce 












Invalid-Partition 



In order to minimise accumulator-passing ar- 
guments, we assume the following initially-empty 
stacks: ParseStack accumulates the category struc- 
tures of the morphemes identified, and FeatureStack 
maintains the rule features encountered so far. ('+' 
indicates concatenation.) 

Partition partitions a two-level analysis into se- 
quences of lexical-surface pairs, each licenced by a 
rule. The base case of the predicate is given in List- 
ing 2|] and the recursive case in Listing 3. 

The recursive Coerce predicate ensures that no 
partition is violated by an obligatory rule. It takes 
three arguments: Result is the output of Partition 
(usually reversed by the calling predicate, hence, 
Coerce deals with the last partition first), PrevCats 
is a register which keeps track of the last morpheme 
category encountered, and Partition returns selected 
elements from Result. The base case of the predicate 
is simply Coerce([] , _, []) - i.e., no more par- 
titions. The recursive case is shown in Listing 4. 
CurrentCats keeps track of the category of the mor- 
pheme which occures in the current partition. The 
invalidity of a partition is determined by Invalid- 
Partition (Listing 5). 

Two-Level- Analysis (Listing 6) is the main 
predicate. It takes a surface string or lexical 



For efficiency, variables appearing in left-context 
and centre expressions are evaluated after Lexical- 
Transitions since they will be fully instantiated then; 
only right-contexts are evaluated after the recursion. 



Partition (Surf Done, SurfToDo, LexDone, LexToDo, LexPtrs, NextCats, Result) 



SurfToDo = [] & 
LexToDo= [[],[],■■■,[]] & 
LexPtrs = [rt ,rt , • • • ,rt] &: 
eos G NextCats Sz 
Result = [] . 



% surface string exhausted 

% all lexical strings exhausted 

% all lexical pointers are at the root node 

% end-of-string 

% output: no more results 



Listing 2 



PARTITION (SurfDone, SurfToDo, LexDone, LexToDo, LexPtrs, NextCats, 

[ResultHead \ ResultTail\) 
there is tl_rule(W, LLC, Lex, RLC, Op, LSC, Surf RSC, Variables, Features) such that 

(Op = (=> or <=>), LexDone = LLC, SurfDone = LSC, 

SurfToDo = Surf+ RSC and LexToDo = Lex + RLC) & 
Lexical-Transitions (Lei, LexPtrs, NewLexPtrs, LexCats) & 
push Features onto FeatureStack &: % keep track of rule features 

if LexCats / nil then % found a morpheme boundary? 

while FeatureStack is not empty % unify rule and lexical features 

unify LexCats with (pop FeatureStack) &: 



push LexCats onto ParseStack &z 
if LexCats G NextCats then 

FoLLOW(-Lea;Ca£s, NewNextCats) 

end if & 

ResultHead = Id/ Surf Done/ Surf /RSC/ 

LexDone /Lex/ RLC /LexCats 8z 
NewSurfDone — SurfDone + reverse Surf &: 
NewSurfToDo = RSC & 
NewLexDone = LexDone + reverse Lex & 
NewLexToDo = RLC & 
Partition (NewSurfDone, NewSurfToDo, 

NewLexDone, NewLexToDo, 

NewLexPtrs, NewNextCats, ResultTail) &: 



% update the parse stack 
% get next category 



% make new arguments 
% ... and recurse 



for all Setld(Var) G Variables 



% check variables 



there is tl_set(5'eiW, Set) such that Var G Set 



Listing 3 



COERCE([Id/LSC/Surf/RSC/LLC/Lex/RLC/LexCats \ ResultTail^, PrevCats, 

[Id/Surf/Lex | PartitionTail\) 
if LexCats / nil then 

CurrentCats = LexCats 

else 

CurrentCats = PrevCats 8z 
not Invalid-Partition(L5'C, Surf, -RSC, LLC, Lex, i?LC ; CurrentCats) & 
COERCE(ResultTail, CurrentCats, PartitionTail). 



Listing 4 



Invalid-Partition^S^ SW/, flSC, LLC, Lex, RLC, Cats) 

there is tl_rule(Jd, LLC, Lex, RLC, <=>, LSC, NotSurf, RSC, Variables, Features) such that 

NotSurf/ Surf & 
for all Setld(Var) G Variables % check variables 

there is tl_set(S'eiW, Set) such that Var G Sei & 
unify Cois with Features &: 
fail. 



Listing 5 



Two-Level- Analysis (7 Surf, ILex, -Partition, -Parse) 
FoLLOW(bos, NextCats) &c 

Partition( [] , Surf, [[],[],•••,[]], Lex, [rt , rt , • ■ • , rt] , NextCats, Result) & 
COERCE(reverse Result, nil, Partition) & 
SHlFT-R,EDUCE(ParseS£acfc, Parse). 

Listing 6 



string(s) and returns a list of partitions and a 
morphosyntactic parse tree. To analyse a sur- 
face form, one calls Two-Level- Analysis^Sw/, 
-Lex, -Partition, -Parse). To generate a surface 
form, one calls Two-Level-Analysis^Sw/, +Lex, 
-Partition, -Parse). 

4 Developing Non-Linear Grammars 

When developing Semitic grammars, one comes 
across various issues and problems which normally 
do not arise with linear grammars. Some can be 
solved by known methods or 'tricks'; others require 
extensions in order to make developing grammars 
easier and more elegant. This section discuss issues 
which normally do not arise when compiling linear 
grammars. 

4.1 Linearity vs. Non-Linearity 

In Semitic languages, non-linearity occurs only in 
stems. Hence, lexical descriptions of stems make 
use of three lexical tapes (pattern, root & vocalism) , 
while those of prefixes and suffixes use the first lexi- 
cal tape. This requires duplicating rules when stat- 
ing lexical constraints. Consider rule R4 (Listing 1). 
It allows the deletion of the first stem vowel by the 
virtue of RLC (even if c 2 was not indexed); hence 
/katab/ — > /ktab/. Now consider adding the suffix 
{eh} 'him/it': /katab/+{eh} — ► /katbeh/, where the 
second stem vowel is deleted since deletion applies 
right-to- left; however, RLC can only cope with stem 
vowels. Rule R7 (Listing 7) is required. One might 
suggest placing constraints on surface expressions in- 
stead. However, doing so causes surface expressions 
to be dependent on other rules. 

Additionally, Lex in R4 and R7 deletes stem vow- 
els. Consider adding the prefix {wa} 'and': {wa} 
+ /katab/ + {eh} — > /wkatbeh/, where the prefix 
vowel is also deleted. To cope with this, two addi- 
tional rules like R4 and R7 are required, but with 
Lex = [[V] , [] , []]. 

We resolve this by allowing the user to write ex- 
pansion rules of the from 

expand( (symbol), (expansion) , (variables)). 

In our example, the expansion rules in (4) are 
needed. 



(4) expand(C, [[C], [],[]], [radical (C)] ). 

expand(C, [ [c] , [C] , [] ] , [radical (C)] ). 

expand(V, [[V] , [] , []], [vowel (V)]). 

expand(V, [[v] , [] , [V]], [vowel (V)]). 

The linguist can then rewrite R4 as R8 (Listing 7), 
and expand it with the command expand (R8) . This 
produces four rules of the form of R4, but with the 
following expressions for Lex and RLC^\ 
Lex RLC 
[[VI] ,[],[]] [[C.V2], [],[]] 
[[VI], [],[]] [[c,v] , [C] , [V2]] 
[[v] , [] , [VI]] [[C.V2], [],[]] 
[[v],[],[Vl]] [[c,v],[C],[V2]] 

4.2 Vocalisation 

Orthographically, Semitic texts are written without 
short vowels. It was suggested by (Beesley et al. 



1989, et. seq.) and (Kiraz, 1994c) to allow short 



vowels to be optionally deleted. This, however, puts 
a constraint on the grammar: no surface expres- 
sion can contain a vowel, lest the vowel is optionally 
deleted. 

We assume full vocalisation in writing rules. A 
second set of rules can allow the deletion of vowels. 
The whole grammar can be taken as the composition 
of the two grammars: e.g. {cvcvc},{ktb},{aa} — * 
/ktab/ -» [ktab, ktb]. 

4.3 Morphosyntactic Issues 

Finite-state models of two-level morphology im- 



plement morphotactics in two ways 
tinuation patterns/classes 



Antworth, 1990; 



Karttunen, 1993) or unification- 



using con- 
( Koskennicmi, 1983) ; 



based grammars (Bear, 1986; Ritchie et al., 1992). 
The former fails to provide elegant morphosyntactic 
parsing for Semitic languages, as will be illustrated 
in this section. 



4.3.1 Stems and X-Theory 

A pattern, a root and a vocalism do not alway 
produce a free stem which can stand on its own. In 
Syriac, for example, some verbal forms are bound: 
they require a stem morpheme which indicates the 
measure in question, e.g. the prefix {?a} for af'el 

10 Note, however, that the expand command does not 
insert b randomly in context expressions. 



tl_rulc(R7, [[] , [] , []], [[v] , [] , [V]], [[c 3 ,b,e] , [] , []],<=>, □ , □ , □ , 

[vowel (V)], [[],[],[]]). 
tl_rule(R8, [] , [VI], [C,V2],<=>, [] , [] , [] , 

[vowel (VI) , vowel (V2) , radical (C) ] , [[],□, []]). 

Listing 7 



synrule(rulel, stem: [X=-2,measure=M,measure=p'al|pa' 'el] , 

[pattern: [] , root : [measure=M,measure=p I al I pa' 1 el] , 
vocalism: [measure=M,measure=p'al Ipa' 'el]]). 

synrulc(rule2, stem: [X=-2 ,measure=M] , 
[stem_aff ix: [measure=M] , 

pattern: [] , root: [measure=M] , vocalism: [measure=M] ]). 
synrulc(rule3, stem: [X=-l ,measure=M,mood=act] , 

[stem : [bar=-2 , measure=M , mood=act] ] ) . 
synrule(rule4, stem: [X=-l ,measure=M,mood=pass] , 

[reflexive: [] , stem: [X=-2 ,measure=M,mood=pass] ]). 
synrulc(rule5, stem: [X=0,measure=M,mood=MD,npg=s&3&m] , 

[st em : [X=- 1 , measure=M , mood=MD] ] ) . 
synrule(rule6, stem: [X=0,measure=M,mood=MD,npg=NPG] , 

[stem: [X=-l ,measure=M,mood=MD] , vim: [type=suf f , circum=no ,npg=NPG] ]). 
synrule(rule7, stem: [X=0,measure=M,mood=MD,npg=NPG] , 

[vim: [type=pref , circum=no ,npg=NPG] , stem: [X=-l ,measure=M,mood=MD]]). 
synrule(rule8, stem: [X=0,measure=M,mood=MD,npg=NPG] , 

[vim: [type=pref , circum=yes ,npg=NPG] , stem: [X=-l ,measure=M,mood=MD] , 

vim: [type=suf f , circum=yes ,npg=NPG] ]). 

Listing 8 



stems. Additionally, passive forms are marked by 
the reflexive morpheme {?et}, while active forms 
are not marked at all. 

This structure of stems can be handled hierarchi- 
cally using X-theory. A stem whose stem morpheme 
is known is assigned X=-2 (Rules 1-2 in Listing 8). 
Rules which indicate mood can apply only to stems 
whose measure has been identified (i.e. they have 
X=-2). The resulting stems are assigned X=-l (Rules 
3-4 in Listing 8). The parsing of Syriac /?etkteb/ 
(from {?et}+/kateb/ after the deletion of /a/ by R4) 
appears in (5).|^| 

(5) stem: [X=- 1] 



reflexive 




pattern root vocalism 



cvcvc 



Now free stems which may stand on their own 
can be assigned X=0. However, some stems require 



In the remaining examples, it is assumed that the 
lexicon and two-level rules are expanded to cater for the 
new material. 



verbal inflectional markers. 

4.3.2 Verbal Inflectional Markers 

With respect to verbal inflexional markers 
(VIMs), there are various types of Semitic verbs: 
those which do not require a VIM (e.g. sing. 3rd 
masc), and those which require a VIM in the form 
of a prefix (e.g. perfect), suffix (e.g. some imperfect 
forms), or circumfix (e.g. other imperfect forms). 

Each VIM is lexically marked inter alia with two 
features: 'type' which states whether it is a prefix or 
a suffix, and 'circum' which denotes whether it is a 
circumfix. Rules 5-8 (Listing 8) handle this. 

The parsing of Syriac /netkatbun/ (from {ne}+ 
{?et}+/katab/-|-{un}) appears in (6). 
(6) 

stem][X=0] 



vim stem:[X=-l] vim 

ne reflexive stem:[X=-2] un 

?et pattern root vocalism 
cvcvc ktb aa 



Verb Class 


Inflections Analysed 


1st Analysis 
(sec/ word) 


Subsequent Analysis 
(sec/ word) 


Mean 
(sec/ word] 


Strong 


78 


5.053 


0.028 


2.539 


Initial nun 


52 


6.756 


0.048 


3.404 


Initial alaph 


57 


4.379 


0.077 


2.228 


Middle alaph 


67 


5.107 


0.061 


2.584 


Overall mean 


63.5 


5.324 


0.054 


2.689 



Table 1 



( Becslcy ct al., 1989| ) handle this problem by find- 
ing a logical expression for the prefix and suffix por- 
tions of circumfix morphemes, and use unification to 
generate only the correct forms - see (Sproat, 1992, 
p. 158). This approach, however, cannot be used 
here since, unlike Arabic, not all Syriac VIMs are in 
the form of circumfixes. 

4.3.3 Interfacing with a Syntactic Parser 

A Semitic 'word' (string separated by word bound- 
ary) may in fact be a clause or a sentence. There- 
fore, a morphosyntactic parsing of a 'word' may be a 
(partial) syntactic parsing of a sentence in the form 
of a (partial) tree. The output of a morphologi- 
cal analyser can be structured in a manner suitable 
for syntactic processing. Using tree-adjoining gram- 
mars ( Joshi, 1985 ) might be a possibility. 



5 Performance 

To test the integrity, robustness and performance 
of the implementation, a two-level grammar of the 
most frequent words in the Syriac New Testament 



was compiled based on the data in (Kiraz, 1994b). 
The grammar covers most classes of verbal and nom- 
inal forms, in addition to prepositions, proper nouns 
and words of Greek origin. A wider coverage would 
involve enlarging the lexicon (currently there are 165 
entries) and might triple the number of two-level 
rules (currently there are c. 50 rules). 

Table 1 provides the results of analysing verbal 
classes. The test for each class represents analysing 
most of its inflexions. The test was executed on a 
Sparc ELC computer. 

By constructing a corpus which consists only of 
the most frequent words, one can estimate the per- 
formance of analysing the corpus as follows, 



P 



5.324n + J2ti 0-054(/ t - 1) 

Si=l fi 



sec/word 



where n is the number of distinct words in the corpus 
and fi is the frequency of occurrence of the ith word. 
The SEDRA database ( Kiraz, 1994a| ) provides such 
data. All occurrences of the 100 most frequent lex- 
emes in their various inflections (a total of 72,240 



occurrences) can be analysed at the rate of 16.35 
words/sec. (Performance will be less if additional 
rules are added for larger coverage.) 

The results may not seem satisfactory when com- 
pared with other prolog implementations of the same 
formalism (cf. 50 words/sec, in (Carter, 1995)). One 
should, however, keep in mind the complexity of Syr- 
iac morphology. In addition to morphological non- 
linearity, phonological conditional changes - conso- 
nantal and vocalic - occur in all stems, and it is 
not unusual to have more than five such changes 
per word. Once developed, a grammar is usually 
compiled into automata which provides better per- 
formance. 

6 Conclusion 

This paper has presented a computational morphol- 
ogy system which is adequate for handling non-linear 
grammars. We are currently expanding the gram- 
mar to cover the whole of New Testament Syriac. 
One of our future goals is to optimise the prolog im- 
plementation for speedy processing and to add d e- 
bugging facilities along the lines of ( Carter, 1995 ). 

For useful results, a Semitic morphological anal- 
yser needs to interact with a syntactic parser in order 
to resolve ambiguities. Most non-vocalised strings 
give more than one solution, and some inflectional 
forms are homographs even if fully vocalised (e.g. in 
Syriac imperfect verbs: sing. 3rd masc. = plural 1st 
common, and sing. 3rd fern. = sing. 2nd masc). We 
mentioned earlier the possibility of using TAGs. 
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